Categories

Versions

Extract Content (Web Mining)

Synopsis

Extracts content from an HTML document.

Description

This operator extracts textual content from a given HTML document and returns the extracted text blocks as documents. Only text blocks consisting of a given number of words are extracted to prevent single words (e.g. in navigation bars) to be kept.

Input

  • document

    The document port.

Output

  • document

    The document port.

Parameters

  • extract_contentSpecifies whether content is extracted or not Range:
  • minimum_text_block_lengthThe minimum length (in words/tokens) of text blocks. Range:
  • override_content_type_informationSpecifies whether potentially existing content type information and used encoding information should be overriden using the HTML meta http-equiv tag. Range:
  • neglegt_span_tagsSpecifies whether <span> tags should be neglected or used as text block divider. Range:
  • neglect_p_tagsSpecifies whether <p> tags should be neglected or used as text block divider. Range:
  • neglect_b_tagsSpecifies whether <b> tags should be neglected or used as text block divider. Range:
  • neglect_i_tagsSpecifies whether <i> tags should be neglected or used as text block divider. Range:
  • neglect_br_tagsSpecifies whether <br> tags should be neglected or used as text block divider. Range:
  • ignore_non_html_tagsSpecifies whether tags that are not common HTML should be ignored. Range: